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Abstract 

The epidemic spreading on arbitrary complex networks is studied in SIR (Susceptible Infected 
Recovered) compartment model. We propose our implementation of a Naive SIR algorithm for 
epidemic simulation spreading on networks that uses data structures efficiently to reduce running 
time. The Naive SIR algorithm models full epidemic dynamics and can be easily upgraded to 
parallel version. We also propose novel algorithm for epidemic simulation spreading on net- 
works called the FastSIR algorithm that has better average case running time than the Naive SIR 
algorithm. The FastSIR algorithm uses novel approach to reduce average case running time by 
constant factor by using probability distributions of the number of infected nodes. Moreover, 
the FastSIR algorithm does not follow epidemic dynamics in time, but still captures all infection 
transfers. Furthermore, we also propose an efficient recursive method for calculating probability 
distributions of the number of infected nodes. Average case running time of both algorithms 
has also been derived and experimental analysis was made on five different empirical complex 
networks. 

Keywords: SIR compartment model. Epidemic spreading simulation. Computational 
epidemiology 



1. Introduction 

Complex networks represent structure of communication networks HI] 101 or social contact 
interactions very well. Therefore, it is reasonable to study computer virus propagation 

or epidemic spreading on complex networks lUt] |0] Modeling the spread of an epidemic 
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in a population is usually done by dividing individuals of the population into subdivision with 
some common characteristic features called compartments. The SIR model is a good model for 
many infectious diseases where each individual in a population can be in one of three different 
compartments. Those who are susceptible to the disease are in the Susceptible compartment, 
those who are infected and can transmit the disease to others are in the Infected compartment 
and those who have recovered and are immune and those who are removed from population 
are in the Recovered compartment. Some infectious diseases are described with models that 
have different number of compartments like SIS model (Susceptible Infected Susceptible) where 
individuals can not have long lasting immunity and therefore Recovered compartment does not 
exist. 

Different mathematical models have been used to study epidemic spreading. Under the as- 
sumption of homogeneous mixing among individuals inside different compartments (Kermack- 
McKendrick model) differential equations can be applied to understand epidemic dynamics ["I^l- 
Contact network epidemiology apphes bond percolation on random graphs (not on arbitrary 
structure) to model epidemic spreading on heterogeneous population (epidemic dynamics is ne- 
glected) (l^ 1 14] Small world network property and scale-free network property jgl] 
11 101 have great impact on epidemic spreading outcome. Scale-free complex networks exhibit no 
epidemic threshold below which the infection cannot produce epidemic outbreak (endemic state) 
in SIS model 

Realistic epidemic simulations (EpiFast |fl6l]. EpiSims UtIi and EpiSindemics |18|) have be- 
come very important application of high-performance computing in epidemic predictions. These 
are just a few examples of parallel algorithms that can be used in public health studies. Some 
studies ifigll lEoll used contact network models between urban cities (cities are connected through 
airline transportation network) and homogeneous mixing model inside urban cities and examined 
influence of interventions (antiviral drugs and containments) to worldwide spread of pandemic. 

Recently, the phase diagrams of epidemic spreading with the SIR model on complex net- 
works were introduced as a useful tool for epidemic spreading analysis lEItl . To predict expected 
value of the number of infected nodes in epidemic spreading on arbitrary network, we should 
repeat simulations till standard deviation of the number of the infected nodes sufficiently reduces 
(in unimodal part of the epidemic phase diagram) or converges to nonzero value (in bimodal part 
of the epidemic phase diagram). Due to the high number of iterations needed to predict expected 
epidemic outcome, average case running time of the epidemic simulation algorithm should be 
as low as possible. In this paper we describe our implementation of a Naive SIR algorithm to 
simulate epidemic spreading (SIR model) on arbitrary network structure with low average case 
running time. Our implementation of the Naive SIR algorithm uses data structures efficiently 
to reduce running time. This algorithm models full dynamics of epidemic spreading and can be 
upgraded easily to parallel version. Main contribution of this paper is a novel algorithm called 
the FastSIR algorithm which uses probability distributions of the number of infected nodes to 
speed up recovery of infected nodes to one discrete step during simulation of epidemic spreading 
to reduce running time. Moreover, the FastSIR algorithm does not follow epidemic dynamics in 
time, but still captures all infection transfers. 

In section |2] we formally define an epidemic simulation problem and other concepts used 
in this paper. Section |3] describes our implementation of Naive SIR algorithm along with the 
running time and the space complexity analysis. In section |4] we describe our novel FastSIR 
algorithm along with the running time and the space complexity analysis. Correctness of the 
FastSIR algorithm was proven with four steps of equality. We also described how to efficiently 
implement probability distributions of the number of infected nodes by a recursive method. In 
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section |5] we described results of the performance profiling and analysis of our algorithms on 
five empirical complex networks. In sections |6] and [7] we described possible applications of our 
algorithms along with discussion of results and conclusion. 

2. Epidemic simulation problem 

We define contact-network as undirected and non-weighted graph G{N, L) (N-set of nodes, 
L-set of links). Link (u, v) exists only if two nodes u and v are in contact during epidemic time. 
We also assume that the contact-network during epidemic process is static one, but the algorithms 
and other results can also be applied even for dynamic networks. To simulate epidemic propa- 
gation through contact-network, we use standard stochastic SIR model. In this model each node 
at some time can be in one of the following states: susceptible (S), infected (I) and recovered 
(R). A discrete time model is used. Time needed for the epidemic to stop spreading defines one 
epidemic simulation. At the beginning of each epidemic simulation all nodes from graph G are 
in susceptible state except arbitrary set of nodes, which are initially infected, with some epidemic 
parameters p and q. These initial conditions we denote with letter A. Epidemic parameter /? is a 
probability that an infected node u infects adjacent susceptible node v in one discrete time step. 
Epidemic parameter ^ is a probability that an infected node recovers in one discrete time step. 
At the end of an epidemic simulation all nodes can be in one of two following states: susceptible 
or recovered. Epidemic simulation on (G, A) is a random instance of epidemic stochastic pro- 
cess ESP{G,A). Epidemic simulations are mutually independent. Let X be a random variable 
that measures a number of infected nodes of epidemic stochastic process ES P{G, A). Given a 
ES P(G, A), the epidemic simulation problem is to compute first n moments of random variable 
X. We will also often use the notation P {X„ - k), which denotes the probability that the infected 
node infects k neighbours out of total n susceptible neighbours in the limit of the time. This 



3. The Naive SIR algorithm 

In standard algorithm for SIR model, an infected node tries to infect its neighbors sequen- 
tially. For each neighboring node a pseudo random number between and 1 is calculated. If the 
number is smaller or equal to p value, the neighboring node is infected. At the end we check if 
the node recovers according to a new pseudo random number and q parameter. In this paper we 
call this algorithm the Naive SIR algorithm. 

In our implementation (see Algorithm [1]) we use a queue for the set of infected nodes / and 
an array structure S for indication of susceptible nodes. If the array value of particular node is 
"1" that node is susceptible. Vice versa, the node is infected or recovered. The network was 
represented using an adjacency list. 

3.1. Time and space complexity analysis of the Naive SIR algorithm 

Here, we examine the average case running time and space complexity of the Naive SIR 
algorithm. For order of growth of average case running time algorithm analysis we use standard 
big-(9 notation (asymptotic upper bound within a constant factor) I.3Q1 . 
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Algorithm 1 The Naive SIR algorithm 



Input: (G, A) where G is contact network and A represents the initial conditions. Initial con- 
ditions consist of p, q, I a queue of initally infected nodes and S (v) is an array indicator of 
susceptible nodes. 

Output: array indicator of recovered nodes R{v) 
while / is not empty do 
dequeuef u, I) 

for each contact v of node u do 
if S (v) is equal to 1 then 

let transmission of infection m => v occur with probability p 
if M => V does occur then 
update S(v) and R(v) 
enqueue ( v, I) 
end if 
end if 
end for 

update state of u from infected to recovered with probability q 

if u is not recovered then 
enqueuef u, I) 

end if 
end while 
output R(v) 



The average case running time of the Naive SIR algorithm TdK [X] , k, q) is equal to: 



T,mX],k,q)^0 



q 



(1) 



where E {X} denotes total expected number of infected nodes and k denotes average degree. 

To explain this statement, let us start with the case of one infected node with k neighbors. In 
one cycle it tries to transmit infection to each of its neighbors. The run-time calculation cost of 
that is proportional to k. At the end of each cycle a random number is compared with q. If the 
number is greater than q the node is moved to set of recovered nodes. Total running time cost T'^ 
for some infected node v,- is sum of costs over all time steps where node v, was infected. Hence, 
it can been seen that number of cycles the node is in infected state is a sample from geometric 
distribution with expectation 1 jq. Because of that, total average running time for one infected 
node is Tl - 0(k/q). Let E [X] be the expected number of infected nodes in the network. 
Because main while loop of the Naive SIR algorithm executes sequentially total average case 
running time Tc is sum of T'^ for all infected nodes v,-. The sum Tc = +T]: + ... + T" has E [X] 
terms. Therefore average case running time Tc is (9(E [X] k^). 

For a network with cycles, it is difficult to analytically calculate the expected number of 
infected nodes, but we can calculate it for a regular m-arry tree. To that end, we will use X„, 
random variable of a number of directly infected susceptible nodes by the infected node of degree 



n Hill]. It can be easily verified that E [X„] = riE [Xi] = nP (Xi = 1). 



Proposition 3.1. The average case running time of the Naive SIR algorithm Tc(¥. [T„] , k) for a 
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m-arry tree of depth n is equal to: 

7;(E[r„] J) = o(E[r„]fc), 

where Tn is a random variable that measures time needed for epidemic to stop spreading in 
regular m-arry tree of depth n and k denotes average degree. 
In particular the expectation ofT„ satisfies the relation: 

1 [mP(Xi = 1)]" - 1 

E r„ «C — 

^ ^ q mP (Xi = 1)- 1 



where the expression ^'"j^, ' = E [X] is the expected total number of infected nodes 121 1. 

Proof. The upper bound on the expected value of a random variable T„ is calculated by ap- 
plying law of total expectation with partition {X,„ - i :Q ^ i ^ m); in case X„ = follows 



E[T„\X,„^0] = E[ro] 



and otherwise E [T„ \X,„ ^ k] ^ E [Tq] + E 



max T 



■(;) 

n-l 



the 



expected value of time needed for epidemic to stop spreading at root level plus the expected 
value of maximum of times needed for epidemic to stop spreading in each of k subtrees with 
depth n-l, where k = 1,2, m 

m 

E [T„] = y E [Tn |X„ = /] P (X„ = < E [Tq] P (X,„ = 0) + 



1=0 



/ 



max T 



■(;■) 

n-l 



I + Ve 
^ it 



max T 
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- + E [r„_i] y / ■ p (x„, = = - + E [r„_i] E [x„,] 



!=1 

= - +mE[r„_i]P(Xi = 1) 
1 



1 mP(Xi = 1)"- 1 

E r„ sC — ! 

q mP(Xi = 1)- 1 



The space complexity S of the Naive SIR algorithm with respect to the number of links L 
and the number of nodes is equal to: 



5[L,A?] 



2L + N 

G / 



S(v) «(..) 



2L + 3N, 



where the first term denotes space complexity of contact network G (adjacency list), the second 
term denotes space complexity of a queue of infected nodes /, the third term denotes space com- 
plexity of an array indicator of susceptible nodes S (v) and the last term denotes space complexity 
of an array indicator of recovered nodes R{v). Note, that S (v) and R{v) can be implemented as a 
bitset structure to further reduce memory consumption. 

In connected networks L ^ N and then the space complexity S of the Naive SIR algorithm 

is: 

S[L,N] = 0(1). 
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4. The FastSIR algorithm 

The main goal of this article is to find a faster algorithm (see Algorithm |2ll for determining 
the total number of infected nodes in epidemic, in which the stochastic simulation of epidemic 
spreading dynamics is not used explicitly. Looking at complexity of Algorithm[T] we can see that 
possible speed up of sequential version of the algorithm can be obtained only by reducing the 
Ijq part. Since we know how to calculate the probability distributions for the number of infected 
nodes [2_1,], the idea is to choose that number from distribution. The probability that the infected 
node infects k neighbours out of total n susceptible neighbours in the limit of the time is: 



For the calculation of cumulative distribution C„(k) - V{X„ < k), p, q and k should be known. 
These values can be calculated on the fly, but they can also be calculated in advance and saved 
on disk. In that case we do not need to repeat calculation for the same k values. Furthermore, 
since we use a few thousand simulations for each 3-tuple p, q, k, it is easy to see a benefit 
of the precalculated distributions. Distributions should be precalculated only once and can be 
used for several networks. The cost of calculation of a distributions for each k up to some 
kma.x is proportional to k„,ax^- However, the benefit of precalculation is evident in cases when 
it is necessary to run simulation using different starting parameters |28]. Furthermore, since 
contact social networks usually have fe^ax up to tens of thousand, it is necessary to precalculate 
distribution once for all of them. 



Algorithm 2 The FastSIR algorithm 

Input: (G, A, C) where G is contact network and A represents the initial conditions, C is cu- 
mulative distribution for p, q and all k values in the network. Initial conditions consist of p, q, 
I a queue of initially infected nodes and S (v) is an array indicator of susceptible nodes. 
Output: array indicator of recovered nodes R(v) 
while / is not empty do 
dequeuef u, I) 

draw a pseudo random value r 

find from C(k, p,q) a maximal value of ki such that C{ki ,p,q) < r, where ki is number of 

infected neighbors 

draw from k neighbors ki nodes w 

for each w do 

if S (w) is equal to 1 then 
update S (w) and R(w) 
enqueue (w, /) 
end if 



A distinction between simulations in the Naive SIR algorithm and the FastSIR algorithm is 
in the parameter that orders the execution of the simulation. For Naive SIR the simulation is 
ordered in (discrete) time: the simulation follows the dynamics of infection transfer as it unfolds 




(2) 



end for 
end while 

output R(v) 
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in time. In the case of FastSIR, the parameter ordering the execution of the simulation is the 
parameter that we call the generation index. All infected nodes can be classified into generations 
according to number of infection transfers from the initially infected node. In particular, the 
initially infected node has a generation index 0, the nodes that it infects have the generation 
index 1 and so on. In FastSIR, the simulation starts from the initially infected node (generation 
0) and using probability distributions for the number of infected nodes, nodes from generation 1 
are determined and the node from generation is recovered. In the n-th step of the simulation, 
starting from the nodes from generation n - 1, the nodes from generation n are determined using 
probability distributions for the number of infected nodes. Then the nodes from the generation 
n - 1 are recovered and the simulation proceeds to the next step. Essentially, as a stochastic 
process, FastSIR captures all infection transfers happening in Naive SIR using different ordering 
(generation versus time). 

4.1. Correctness of the FastSIR algorithm 

To see correctness of the FastSIR algorithm (see Algorithm |2ll we change Naive algorithm 
in a couple of steps that guarantee equality with respect to all infection transfers happening in 
Naive SIR process. 

• First, all nodes infected directly by initially infected nodes can not recover nor infect their 
neighbors until the last of the initially infected nodes is recovered. Then, process is re- 
peated so that all infected nodes in moment of recovery of the last initially infected node 
are defined as the initially infected nodes. It is clear that in this way the probability of in- 
fection of any neighbor of initially infected nodes directly by any of initially infected nodes 
remains unchanged. Since all probabilities of direct transfer of infection remain unchanged 
until the end of the algorithm, we conclude that this modification leaves a probability of 
infection of any node unchanged. 

• The second step differs from the first step of modifying the Naive SIR algorithm in the 
way that all nodes except the initially infected node cannot recover nor infect their neigh- 
bors until the chosen initially infected node is recovered. Then we choose another node 
that plays the role of the initially infected node and repeat the process. Probabilities of 
direct infection of any of the neighbors of initially infected nodes are not changed because 
the probability of transmission of infection from each initially infected node to any of its 
neighboring nodes remained the same, and the order in which we have chosen initially in- 
fected nodes does not affect the probability of infection of susceptible neighbors of initially 
infected nodes. 

• The third step is the reduction of all the steps of recovery of chosen initially infected 
node to a single step; if the initially infected node has m susceptible neighbors, by using 
a distribution of a random variable of infection we can determine the realization of the 
number of infected nodes, and then realization of that number of infected nodes among the 
susceptible m. The probability of direct transmission of infection remains unchanged due 
to the construction of a random variable of infection. 

• Fourth step uses the principle which we prove in the following proposition. It states that 
in the previous step, number n of adjacent nodes can be taken instead of number m of 
susceptible nodes, as long as only susceptible adjacent nodes are infected in the process. 
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Proposition 4.1. Let node v have n neighbors of which s\, . . .,s„ are susceptible and im+i, ■ ■ ■ ,in 
cannot be infected, and let Y be a random variable of number of nodes infected directly by 
node V. Alternatively, let node v have the same n neighbours s\, . . . , s„,, i„,+i, . . . ,in which are 
susceptible, and let Z be a random variable of number of nodes infected directly by node v 
among nodes si, . . . , s„,. Then Y i Z are identically distributed random variables. Furthermore, 
in both instances node si has the same probability of being directly infected by node v for all i, 
I ^ i !^ m. 

Proof. Let node v have m susceptible neighbours and degree n. Probability that k out of m 
susceptible neighbours end up being infected by node v is obviously P (X,„ = k). Probability that 
by infecting n neighbours, out of which m can be infected and n - m can not, is obtained as 
follows: 

Probability that / predetermined nodes out of n susceptible nodes become infected is P* (X„ = /). 
We know that only m out of n nodes are actually susceptible, so we have to choose / out of that k 
nodes which are in the set of m susceptible nodes, and remaining i - kin the set of n - m which 
can not be infected. That can be done in different ways. Probability that in the end there 

are going to be k infected nodes in the set of m susceptible nodes is 



The only thing left to do is compare these two expressions . We will use following relations: 




(3) 




(4) 




(5) 




(6) 
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n-m+k 

and obviously ([8) 

n-m+k 

By taking m = 1 in both instances in equations ((3) and (|8) we obtain the same probability of 
node Sj being directly infected by node v for all /, 1 ^ i ^ m. 



n - m 
i-k 



(7) 



(n — in\ 



4.2. Time and space complexity analysis of the FastSIR algorithm 

Here, we examine the average case running time and space complexity of the FastSIR al- 
gorithm. For order of growth of average case running time algorithm analysis we use standard 
big-(9 notation (asymptotic upper bound within a constant factor) [30]. 

Proposition 4.1. The average case running time of the FastSIR algorithm Tf is equal to: 

tJ- ^ 0{¥.[X]k), 

where E [X] denotes total expected number of infected nodes and k average degree in network. 

Proof. Let us start with one infected node and its k (degree) susceptible neighbors. Since dis- 
tribution of number of infected neighbors is precalculated and it is possible to access to data 
with Oi l) we can neglect that to overall cost. The first step is uniformly choosing a value for a 
cumulative distribution. Since the parameters of p, q and k are known we should find appropriate 
number of infected nodes ki for that realisation. From the fact that there are k+\ possible values 
we can find k\ in log(k) steps using binary search algorithm. In the next step, a random sample 
of ki nodes should be chosen, that would be infected, from k of them. For that operation, calcu- 



lation cost is proportional to min{k\ ,k — k\) 112211 . In the last step, infection should be transmitted 



to k\ neighboring nodes. So the calculation cost is proportional to k\. The overall running time 
for one infected node r| and k susceptible neighbors can be calculated from the sum of costs for 
all three steps and it is equal to 
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r]. = cilog(k) + C2min(ki,k - ki) + Ciki = Oik) (9) 

where ci, C2 and C3 are constants. Since ki < k, using big O notation it can be seen that 
average case running time for one node is 0{k). Hence, it does not depend on l/q. Total average 
case running time is the sum of average times Tj for all infected nodes v, because main while 

loop of the FastSIR algorithm (see Algorithm|2]i executes sequentially. This sum Tj + Tj + ... + T'^ 
has E [X] terms which have 0{k) average case running time. Therefore average case running time 
77 is equal to the O ( E [X] k) . 

If asymptotic running times of FastSIR and Naive algorithm are compared, it can be seen 
that asymptotic upper bound average case running time of FastSIR is 1 /<7 times lower. However, 
if we look in more detail in the running time for the case when q equals one and we have an 
infected node and k of its neighbors, their asymptotic upper bounds of average running times are 
the same. But, it can be seen that there is a run-time difference. Let that running times ratio 
between FastSIR and Naive SIR algorithms be rq where r is a value that depends on epidemic 
parameters and the network structure. Let that k be big enough. In the case when q equals 
one we have r'. For FastSIR running time is equal to the sum in equation (|9]l. For the Naive 
SIR algorithm that running time is tI - c^k and r' = Tj/T^. For FastSIR the first part of 
running time is from finding appropriate ki value for obtained random value of the cumulative 
distribution, the second part is from random selection of ki neighbors, while the last part is from 
the process of transmission of infection to ki neighbors. Since, the code for the last part of the 
FastSIR algorithm is almost equal to code of the Naive SIR algorithm, we can take that C4 and 
C3 are approximately equal. Whereas k\ ^k for very small values of parameter p, it is expected 
that r' < 1. For some middle range of p values ^1 is around kjl. For that case, it can be seen that 
r' can be greater than 1 if C2 > C3. When p is near \,ki ^ k, so the middle element of the sum is 
neglected and value is around 1 or slightly greater. 

The value of ki is also dependent on q parameter In the case of same p, for smaller value of q, 
ki would be larger and vice versa. We can conclude that worst influence of epidemic parameters 
for duration of execution of the FastSIR algorithm would have p and q values for which ki ^ k/2. 

The network structure has an influence on the ratio too. Looking at the for loop part of 
both algorithms we can see branching that depends of the state (susceptible or not susceptible) 
of the neighbouring node. Although upper bound values of C3 and C4 constants can be easily 
determined, their true values depend on the network structure. If structure has form of a tree or a 
chain, the infection can be transmitted only from one direction. The values of C3 and C4 are half 
the value and over half the value of upper bound in the cases of chain and tree, respectively. When 
the network structure has a lot of cycles, the infection can be transmitted from many directions 
and values of C3 and C4 are lower. Also, it is important to emphasise that C2 is independent on 
the network structure. In accordance with above analysis the FastSIR algorithm would be slower 
than Naive Algorithm, for some specific values of epidemic parameters, if and only if C2 > C3. 
Hence, for the networks with more cycles there is a greater chance that, for some values of 
infected parameters, the Naive SIR algorithm would be faster 

The space complexity S of the FastSIR algorithm with the respect to the number of links L, 
the number of nodes and the sum of all distinct degrees in network K is equal to: 

S[L,N,K]^^2L^ + ^^+ N + N + N =2L + K + 3N, 

C C I S(v) R(v) 
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where the first term denotes space complexity of contact network G (adjacency list), the second 
term denotes space complexity of cumulative distributions C for all distinct degrees fc,- in G, the 
third term denotes the space complexity of a queue of infected nodes /, the next term denotes the 
space complexity of a vector indicator of susceptible nodes S (v) and the last term denotes the 
space complexity of a vector indicator of recovered nodes Riv). Note, that the ^(v) and R{v) can 
be implemented as a bitset structure to further reduce memory consumption. 

In connected networks L ^ N and 2L ^ K and then the space complexity S of the FastSIR 
algorithm is: 

S[L,N,K] = 0(L). 
The values of L, N and K for studied networks are presented in Table |5] 

4.3. Implementation of distribution precalculation 

Looking at the cumulative distribution formula it can be seen that calculation cost is propor- 
tional to k,„a^'^. Speed up can be achieved using the fact that binomial coefficient values and the 
fraction part of the formula are repeated so caching them we can obtain lower calculation costs. 
However, we achieved further speed up using a recursive formulafTOl 

Proposition 4.1. For each k^O 

P(x„ = ^) = Inxn-i ^k-i)- |llp(x„ ^k-l) (10) 



Proof. 
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Using this recursive formula the computation cost is proportional to k,„ax^- It is very impor- 
tant to mention that in programming of cumulative distribution one should be very careful with 
precision. Because of that we use multiple precision library for this calculation. Empirically we 
obtained that it is safe to set the precision to be at least 0.8 times degree bits. The minimum 
precision is 64 bits. During the testing of calculation time we noticed that cost for large degree 
values predominantly depended on the precision used. The cumulative distribution values should 
be precalculated for a specific maximum degree only once and they can be used for all networks 
that have degrees less then maximum one. We consider that 50000 is high enough value of de- 
gree for majority of networks. Similar recursive formula can be used when random variable of 
time of recovery for each node is distributed as negative binomial probability distribution. 

4.4. Parallelization of the algorithm 

Like in similar algorithms |16], parallelization can be performed by partition of networks 
using MPl. Since we used a large number of repetitions it can be also naively parallelized per- 
forming each repetition on a separate core. The precalculation of distribution is also naturally 
parallelizable. Parallelization using GPUs is a very challenging task and it will be the scope of 
one of our next investigations. 



5. Experimental results 

In this section, we describe some detailed performance profiling and analysis of FastSIR 
implementation on our test server The server has 4 Quad Core 2.4 GHz Intel E5330 processors 
and 50 GB of RAM memory. For test purposes we use only one core for each test. Algorithms 



are implemented in C using igraph 123,1 and gmp libraries 1124 1. 

The analysis was performed on several empirical networks: a network of 2003 condensed 
matter collaborations (cond-mat 2003) introduced in |4'], an undirected, unweighted network 
representing the topology of the US Western States Power Grid (power grid) [8J, a network of 
coauthorships between scientists posting preprints on the Astrophysics E-Print Archive between 
Jan 1, 1995 and December 31, 1999 (astro physics) ['Tj, a symmetrized snapshot of the structure 
of the Internet at the level of autonomous systems, reconstructed from BGP tables posted by the 
University of Oregon Route Views Project (Internet) iEtIi and a network of Live Journal users 
(Live Journal) ll29ll . Table[T]shows the basic information for above mentioned networks. 



Table 1: Basic network parameters 



Network 


no of nodes 


no of links 




k 


sum of distinct degrees 


Power grid 


4 941 


6 594 


19 


2.1 


142 


Cond-mat 2003 


27 519 


116 181 


202 


8.4 


8 619 


Astro physics 


14 845 


119 652 


360 


16.1 


16 737 


Internet 


22 963 


48 436 


2390 


4.2 


32 118 


Live Journal 


5 189 809 


77 365 447 


15 023 


29.6 


2 503 563 



For each analysis we measured running time of the Naive SIR algorithm and the FastSIR 
algorithm. Loading network structure data (adjacency list) from disc was not measured in run- 
ning time analysis for both algorithms. However, loading precalculated probability distributions 
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Table 2: Running time in seconds for 2000 simulations, p = 0.2, 0.5, 0.8 and q = 0. 1 



Network 


p=0.2 


p=0.5 


p=0.8 


Naive SIR 


Fasts IR 


Naive SIR 


Fasts IR 


Naive SIR 


Fasts IR 


Power grid 


3.2 


0.4 


7.0 


0.9 


7.1 


0.9 


Cond-mat 2003 


67.7 


9.6 


63.3 


8.6 


61.2 


7.9 


Astro Physics 


44.1 


7.0 


41.2 


5.9 


39.9 


5.1 


Internet 


42.5 


5.0 


41.6 


4.9 


40.5 


4.7 


Live Journal 


50 683 


6 699 


48 373 


5 635 


47 531 


5 078 



from disc was measured in running time analysis for the FastSIR algorithm. Also, we measured 
execution time for distribution precalculation. We studied the entire (p, q) parametric space of 
the SIR model: a [0, 1] x [0, 1] square. The step value for both p and q was 0.1. Each simula- 
tion was started from the same node, and it was performed 2000 times. Upper bound memory 
consumption for all experiments was 9 GB. Although some authors use only a several dozen 
of repetitions, we consider that is not enough for stable results in the bimodal part of the phase 
space. The results of running time for p values of 0.2, 0.5 and 0.8 and q value of 0.1 for all tested 
networks are presented in Table|2] In addition in Table[3]are presented results for p values of 0.2, 
0.5 and 0.8 and q values between 0.1 and 1. Graphs of results obtained for p values of 0.2, 0.5 
and 0.8 and different values of q for all networks are presented in Figure [1] Figure |2] and Figure 
|3] respectively. Those figures show ratio of running time between Naive SIR and FastSIR. 

Table 3: Running time in seconds for Live Journal network. Parameters: 2000 simulations, p = 0.2, 0.5 and 0.8, q = 0.1 
to 1 



q 


p=0.2 


p=0.5 


p=0.8 


Naive SIR 


fastSIR 


Naive SIR 


fastSIR 


Naive SIR 


fastSIR 


0.1 


50 683 


6 699 


48 373 


5 635 


47 531 


5 078 


0.2 


25 841 


7 200 


24 398 


6 314 


24 067 


5 357 


0.3 


18 550 


7 200 


16 580 


6 841 


16 276 


5 609 


0.4 


13 686 


6 987 


12 870 


7 259 


12 329 


5 843 


0.5 


13 197 


6 704 


10 394 


7 591 


9 951 


6 060 


0.6 


9 394 


6 400 


8 720 


7 859 


8 345 


6 253 


0.7 


8 301 


6 073 


7513 


8 072 


7 185 


6 429 


0.8 


8 744 


5 805 


6 622 


8 250 


6 293 


6 592 


0.9 


7 521 


5 508 


5 869 


8 555 


5 597 


6 749 


1 


5 291 


5 259 


5 064 


8 666 


5 082 


7 777 



Results differ between networks, but the trend is that the ratio is approximately proportional 
to \lq. When q value is near one, the running time ratio differs dependingly on the network and 
the value of p. In can been seen that results are in accordance with the above analysis. When p is 
small {p = 0.2), the FastSIR algorithm is faster or equal to Naive SIR for all q values. But, when 
p has value of 0.5, the Naive SIR algorithm is faster for larger q values for almost all networks. 
In addition when p has value 0.8 the Naive SIR algorithm can be faster only for some networks 
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Figure 3: Running time ratio for Naive SIR and the FastSIR algorithm. Parameters: 2000 simulations, p = 0.8, q = 0.1 
to 1 (Online version in colour). 




Astro Physics 
Cond-mat 2003 
Internet 
Power grid 
Live journal 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 



Figure 4: Running time ratio for Naive SIR and the FastSIR algorithm on parametric space on Internet network (Online 
version in colour). 
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Figure 5: Running time ratio for Naive SIR and the FastSIR algorithm on parametric space on Astro Physics network, 
white line represents border of the area of the parametric space where running time ratio is strictly less than one (Online 
version in colour). 
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P 



and q values very close to one. However, for small q values the FastSIR algorithm is much faster 
and i.e. for q value of 0. 1, the ratio is between 7 and 9.5 depending on the network and the value 
of p. 

Above analysis of the running time ratio between Naive SIR and the FastSIR algorithm can 
be summarised in Figures |4]|5]|6]|2] and [8] In these Figures we show running time ratio on entire 
parametric space {p, q) (averaged over 2000 simulations) and denote area of the parametric space 
(white line) where running time ratio is strictly less than one. It can also be seen that speed-up is 
dependent on the network structure. In accordance with analysis in Section|4] Naive algorithm is 
rarely faster than the FastSIR algorithm for networks of which structure is tree-like (Internet) or 
chain (Power grid). For networks with more cycles (i.e. Live Journal) Naive Algorithm is faster 
for more epidemic parameters. 

It is very important to emphasise that results for Live Journal network of 5 mil. nodes and 77 
mil. links are very fast. Average case running time for one simulation for p of 0.2 was between 
2 and 4 seconds for the FastSIR algorithm. The worst obtained result for entire (p, q) space 
was 5 seconds. Even results for the Naive SIR algorithm for that network did not exceed 30 
seconds for one simulation. Furthermore, it should be stressed that results are achieved without 
parallelization. Hence, implementations of both algorithms can be used for large networks. 

6. Application 

Our version of the Naive SIR algorithm models full epidemic spreading dynamics and has 
low average case running time. Therefore, the Naive SIR algorithm is a core algorithm that 
models epidemic spreading and can easily be upgraded to parallel version with interventions. 
Interventions represent all sort of measures that are done by purpose in order to influence impact 
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Figure 6: Running time ratio for Naive SIR and the FastSIR algorithm on parametric space Cond-mat 2003 network, 
white line represents border of the area of the parametric space where running time ratio is strictly less than one (Online 
version in colour). 
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Figure 8: Running time ratio for Naive SIR and the FastSIR algorithm on parametric space on Power grid network, white 
line represents border of the area of the parametric space where running time ratio is strictly less than one (Online version 
in colour). 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



of epidemic spreading like using antiviral drugs or containment measures. This algorithm has 
wide areas of application in virus propagation, epidemic or knowledge spreading with SIR model. 
The FastSIR algorithm uses precalculated probability distributions of the number of infected 
nodes to reduce average case running time by a constant factor The FastSIR algorithm has been 
used in study of the influence of the initially infected node to epidemic impacts due to its speed 
up when all nodes in network should be examined as initially infected node to calculate epidemic 
risks over entire parametric (p, q) space i28ll . This also has practical importance since the choice 
of the initially infected node may describe difference between a random outbreak and a terrorist 
attack. 

7. Discussion and conclusions 

In this paper we have described how to construct our version of the Naive SIR algorithm and 
we have proposed completely new FastSIR algorithm for epidemic spreading simulations (SIR 
model) on an arbitrary network structure with reduced running time. Running time of the Naive 
SIR algorithm was reduced by using two data structures (queue and array) simultaneously for 
storing node states. This algorithm models full epidemic dynamics and thus can be easily up- 
graded to a parallel version with interventions (antiviral drugs and containments). Upper bound 
of the average case running time of the Naive SIR algorithm on m-arry tree has been analytically 
derived. 

We propose a novel FastSIR algorithm to reduce the average case running time of the Naive 
SIR algorithm by approximately constant factor 1 Iq in one huge area of parametric space {p, q). 
We also showed that average case running time of the FastSIR algorithm is equal to total expected 
number of infected nodes times average node degree. This low average case running time was 
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accomplished by using precalculated probability distributions of the number of infected nodes 
along with binary search and a simple sampling algorithm. Correctness of this new algorithm 
has also been proven in four steps of equality of two algorithms. Precalculation of probability 
distributions of the number of infected nodes should be done with caution to avoid numerical 
errors. It is very important to mention that the FastSlR algorithm can be used in the same manner 
when time od recovery for each infected node is distributed as the negative binomial probability 
distribution. 

It should also be clear that Naive SIR and FastSlR algorithms can be merged to a Hybrid SIR 
algorithm due to four steps of equivalence between Naive SIR and FastSlR algorithms. In one 
segment of the phase diagram Naive SIR can be faster than Fast SIR algorithm. Therefore Hybrid 
SIR algorithm could use one of two algorithms depending on position of an infective agent in the 
phase diagram in order to decrease running time. For the segment of the phase diagram where 
it is expected that Naive can be faster, we can estimate duration of both algorithm for a few 
simulations and then use the faster one for all other simulations (adaptive control). 

Experimental analysis was made on five different empirical complex network on a single core 
processor To the best of our knowledge our algorithms have a far shorter total running time per 
epidemic simulation than other algorithms. Parallelization of this algorithms was not in scope 
of this research and was left for future work. Empirical results show that the FastSlR algorithm 
is approximately faster than the Naive SIR algorithm hy Ijq constant factor for a great part of 
parametric space (p, q) which is in the excellent agreement with our expectations. 
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